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ABSTBACT - 

A potentially valuable neasure of overconf idence on 
probabilistic nultiple-choice tests vas evaluated* The aeasure of 
cverconfldence vas based on probabilistic response^ to nonsense itens 
embedded in a vocabulary test. The test vas adainistered under both 
confidence response and conventional choice response directions to 
208 undergraduate educational psychology scudents. Heasures of 
^^vocabulary knowledge based on confidenc(&: end, choice resp^puses, 
overconfidence, and risk^-taking propensity Were obtained. The results 
indicated that overconf idence was significantly related in a negative 
direction to probabilistic vocabulary scorers. A moderate correlation 
was found between overconf idence and risk-vaking propensity. However, 
the scatter plot for these, aeasures showeli that this relationship nay 
have been spurious. (Author) 
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OVER-CONFIDENCE ON PROBABILISTIC T^STS 

ROGER A. KOEHLER 
University of Nebraska 

i ' • . " ' ' ■ 

Probabilistic or confidence testing has been recommended (e.g«, Shuford, 
Albert, & Massengill, 1966) as a more reliable and more valid response procedure 
for objective examinations than the conventional choice- response method. Through 
the assignment of subjective probabilities as to the attractiveness of each Item 
alternative, probabilistic testing is designed to remove the guessing factor 
from objective tests, and to also measure various degrees of partial Informaticn. 
Numerous procedures for obtaining test scores based on confidence* assignments have 
been proposed. However, Shuford et. al. have suggested that probabilistic responses 
will yield maximal expected scores if and only if a "reproducing" scoring function 
is utilized talpbtatnlt©^ The use of such scoring functions encourages 

examinees to be "honest" in their expression of subjective probabilities by yield- ^ 
'ing a severe penalty when high confidence is assigned to an incorrect alternatdj^y^* 

The, literature (e.g., Rippey, 1970; Romberg & Shepler, 1968; Hambleton, 
Roberts & Traub, 1970; Koehler, 1971) indicates no consistent trends with respect 



to the improvement of test reliability and/or validity through probabilistic 
response procedures, de Finetti (1965) suggested that the success of probabilistic 
response procedures is dependent upon examinee understanding of the item scoring 
function and the expected pay-offs tmder various degrees of uncertainty. Perhaps 
th^ conflicting results df previous studies is partially attributable to a lack 

adequate training in confidence response methodology. If, through extensive 
graining, higher reliabilities can be obtained when C(onfidence tests are used 



2 



the 



answe 



in place of conventional choice tests, an important question 
What produces the increased reliabilities? Are they due to 
of measurement through subjective probability assignment, or 
occur as a result of reliably measuring some^ dimension or tra 
trait the test was designed to msiasure? If an affirmative 
latter question, one would have a difficult time arguing in 
testing procedures. 

The purposes of the present study, therefore, were to 
"over-confidence" on probabilisitc tests, to assess the 
of such a.measure, and to investigate the relationship of'^over 
to knowledge and to risk-taking propensity. Several authors 
1972; Stanley & Wang, 1970; Hansen, 1971) have implied that 
confidence expressed by examinees responding to test items tn 
marking procedures can be equated to risk-taking propensity 



ifemains unanswered* 

more precise nature 
(tould such increases 
t in addition to the 
r was given to the 
r of confidence 



favoi 



METHOD 



The experimental in||cument for this study wad^ a 40 item 



vocabulary test. I^h|oii|^#^ within these 40 items were 
where a nonsense item is^^defined as an item that has no correct nor incorrect 
answer. An example of a nonsense item on the test is as follDWs: 
22. Bilious: sad_ double greed y bitter_ 



measurement 



dev^op a measure of 

characteristics 
-confidence on tests 
(e.g., Echtemacht, 

and/or under 
rough confidence 



oyer 



multiple-choice 
seven nonsense items. 



Since "Bilious" has no meaning in the English language, the ajjove item has no best 
(correct) answer and no incorrect answer. 
Vocabulary Measures: 

Vocabulary scores on the 33 legitimate test items were 6] 



ibtained from both a 



confidence assignment administration and a conventional choice administration that 
employed do-not-guess directions. The confidence response detections requested 



examinees to assign their percent of confidence In each alternative to the 
nearest- hundredth » making sure their confidence for all alternatives of an Item 
summed to 100 percent* The vocabulary score for Item (j), based on the confi- 
dence marking directions was obtained by each of the three "reproducing" scoring 
functions below: 

m ' 

S,4 « 2Pv - E P4 2 - ^ — (quadratic) 

1-1 



logPj^)/2 .01 < Pj^\^ 1 

0 s Pp. < ,01 



■(logarithmic) 



m 



" V ( ^ ^i^)^ (spherical) 

where P^^ Is the probability (confidence) assigned to alternative 1» Is the 

confidence expressed In the keyed alternative, and m is the number of alternatives 

per item. Total confidence response vocabulary scores (S^* S^) were cal-* 

culated by summing the above item scores over all items. Choice responses to the 

33 legitimate vocabulary items administered under conventional do-not-guess* 

directions ^yielded number right scores (L) and "corrected for guessing" scores (G). 

Over-Confidence Measures; 

The measure of over-confidence was based on confidence responses to the seven 

nonsense vocabulary items, where the over-conf id-snce for nonsense item (j) was 

determined by the formula: 

' . m 

C^ - I (Pi - l/m)^/(l - 1/m). 

^ i»l 



C^j ranges from a low of zero (equal probability assigned to each alternative) to a 
high of one (total confidence assigned to a single alternative). The total over- 
confidence (C^) expressed by an, examinee was the sum 6f the C^^ values on the seven 
nonsense items. For comparative purposes, .and additional measure of confidence 



developed by Hansen (1971) and based on probabilistic responses to legitimate 
items was calculated as: 

In the latter fotmula, 

33r m 
C« = (1/33) Z m/2(m-l) Z |l/m - P 
j«lL i«l 

and is tKe linear estimate of C.^ using as a predictor variable. is a 

measure of the degree of certainty expressed through confidence responses to 

A 

legi'^^'te test items. The procedure for determining was also employed with 
and S2 as predictors and yielded two additional confidence measures, and C2 
respectively. 

Finally, a measure of risk-taking propensity (R) wa,s calculated as the pro- 
portion of' nonsense items attempted when the vocabulary test described above was 
administered under conventional do-not-guess directions. This risk measure 
has been extensively used in research (e.g., Slakter, 1967, 1968a, 1968b, 1969; 
Slakter & Koehler, 1968) and has yielded high reliabilities for very few 
nonsense items r 

A summary of the total scores derived from the two administrations (confidence 
response and conventional do-no t-guess) of the 40 .item vocabulary test are 
listed below: 



1) 


" over-confidence on nonsense items 


(C4) 




2) 


quadratic vocabulary score (Sj) 






3) 


logarithmic vocabulary score (S2) 






4) 


spherical vocabulary score (S3) 






5) 


number right vocabulary score (L) 






6) 


"corrected for guessing" vocabulary score 


(G) 


7) 


risk-taking propensity (R) 






8) 


residual conf idence-partialling Sj^ 


from Cij 


(Ci) 




9) residual conf Idence-partlalllng $2 from (^(€2) 
10) residual confidence-part lalling S3 from C^CC^) 

Ss for the study were all available students enrolled In an undergraduate 
educational psychology course; the sample totaled 208 students. Testing sessions 
for all Ss went as follows: 

1. A training booklet was administered to teach Ss how to respond under 
confidence marking directions. The training booklet was designed 
specifically to help Ss become familiar with jthe following: 

a) the confidence response procedure 

b) the logarithmic scoring function (S^) 

c) the pay-offs' for responding In various manners under 
several degrees of uivcertainty (I.e., llltistratlons 
pertaining to the severity of the penalties assessed 
for expressing high confidence In incorrect alternatives 
were presented) . 

Training was provided only for scoring function Si In order to test the 
.conjecture thiat scoring function familiarity and expected pay-offs are 
necessary for the success of confidence marking ioethods. Four contrived 
vocabulary items were placed at the end of the training booklet for the 
purpose of evaluating the success of the booklet. 

2. The vocabulary test was administered through confidence response 
directions. At the completion of this administration all test b^ook- 
lets were collected. r 

3. The vocabulary test with a random reordering of items was administered 
under conventional do-no t-guess directions. 

RESULTS AND DISCUSSION 

An investigation of the responsss to the four contrived vocabulary items at 



the, end of the training booklet provided ievidence that Ss understood the confi- 
dence marking procedure* For the extremely simple item, Ss assigned 100 percent 
confidence to the keyed alternative, for the very difficult item, most Ss equally 
distributed their confidence among the alternatives, and for the other two items, 
Ss appeared to dis^tribute their confidence in the expected percentages. 

Table 1 presents the means, standard deviations, and coefficient alpha 
reliability estimates for all scores obtained in the study. 



Insert Table 1 about here 



An inspection of Table 1 indicates that training in the use of the scoring 
function did not yield higher reliabilities for that function over the S2 or S3 
functions. In addition, reliabilities of confidence response scores were generally 
about the same as those for the L and G conventional response scores (the largest 
difference occurred between L and S2 scores, .85 versus . 74 l&espectively) . While 
the reliability for the confidence measure C4 was not what one might desire, 
it must be remembered that is based on only seven items. Using the Spearman-* 
Brown Prophecy formula, a set of 33 nonsense items should yield reliability of. 
.86 for C4, which is comparable to the reliabilities of the other scores obtained 
in the study. A reliability of .87 for R, which is also based on the seven non- 
sense items is consistent with previous research on this risk measure (e.g. , 
Slakter, 1969; Slakter & Cramer, 1969; Slakter & Koehler, 1968). 

Note that the mean of the C4 confidence measure is quite low (i.e., only 
0.70 when the maximum possible C4 is 7.00), Two factors may have contributed 
to this low C4 mean. First, the overall test was very difficult (mean of number 
right scores was only 14.00 of a possible 33 points). This general difficulty 
<ittay have forced more Ss into a conservative response position. Secondly, the 



formula upon which is calculated is biased toward the low end of the ^(0-^ 
Interval; I.e., as confidence increases, increases at a much sloweryTat^. 
Table 2 contains the intercorrelations among all scores obtainedTthrq^gh 

• • ■ ./ ■ 

both the confidence and the conventional test administrations, / 



Insert Table 2 about here 



As would be expected, the correlations among the vocabulary test scores Sj^, S^, 

L, and G were generally high; about the same magnitude as the reliajbilities* The 

^ . , / . ^ 

correlations between vocabulary scores obtained under confidence response directions 

(S^f and S^) were significantly (a < .01) correlated in a negative direction | 
with both over-confidence (C4) and risk-taking propensity (R) • /This latter finding 
Implies that confidence response vocabulary scores tend to be lower fo^r Ss who are 
oveVly 'confident of their responales or whp possess a high propensity for taking 
risks. Since Ss vary with respect to/ their: confidence expression and/or risk*- ; 
taking behavior, probabilistic testing methods appear to confound knowledge with 
these two personality traits. Although "corrected for guessing" (G) scores were ' 
also significantly (cx < .01) related to confidence expression, the strength of 
association (r^ « .04) was somewhat less than that of the ~ Si (r^ » .16), C4 - S 
(r2 ■ .20), and - S3 (r^- .10) relationshipa. If a testing method were to be 
chosen bailed on the above results, the conventional testing method using number- 
right (L) scores would appear to be the most valid procedure, since this testing 
method yields vocabulary scores that are essentially unrelated to over^^confidence 
and risk-taking propensity. 

It is interesting to note that confidence measures Cj^, and C3 correlated 
positively (significant at the .01 level) with L scores and G scores. In fact, 
several correlations between, these legitimate item confidence measures and donven- 
tional vocabulary scores were of the same magnitude as the correlations between 



legitimate Item confidence meaisures and the nonsense Item confidence measure (C4) * 
Perhaps the above finding Indicates that the linear regression procedure used to 
obtain C]^» C2, and €3 was not totally successful in removing the knowledge dimension 
from the Cx scores. Using S]^, S2> or S3 to partial the knowledge dimension from 
Cx scores may not be entirely valid. If over-confidence does account for a' portion 
of the variation in confidence response vocabulary scores, both knowledge variation 
and confidence variation are removed by the linear regression procedure. 

. With respect to the relationship between risk-taking propensity and over- ^ 
confidence, the present study indicates a moderate (significant at the .01 level) . 
relationship between and R, and essentially zero relationship when C^s C2> and 
C3 are compared to R. The relationship between and R may be of a spurious 
nature, since an inspection of the scatter diagram for these variables repealed 
a rather skewed distribution for Ca scores. Most Ca scores ranged between zero 
and two, while only a very few relatively high risk takers scored greater thian 
2.5 on the C4 meast^e. jIn addition, the relationship between C4 and R may be 
attributed to the fact that these two measures are based on the same few nonsense 
items. Based on the above relationships, it would appear that over- confidence 
and risk-*taking propensity are not identical traits as previous^ authors have ^ 
suggested. 

Since the reliability of the scores was not as high as\he reli^^bilities 



:e^n 



of the other scores generatedSin this study (See Table 1), estimated of the 

\ 

correlations between C4 and the other scores assuming all measures to be perfectly 
reliable were calculated and are presented in row one of Table 2. In most casea^^ 
these estimates tend to support the conclusions made previously. 

The results presented above are subject to the limitations inherent In « 
this atudy. The most serious limitation involves the experimental Instrument 
(vocabulary test) that was used to cassess knowledge and confidence. Since a 



vocabulary test bore little relationship to the objectives of the educational 

I ■ ■■ ■ 

psychology course from which Ss were obtained » there may have been minimal 
Incentive for Ss to be completely honest in their expressions of confidence. 
Therefore, farther research regarding the problem described in this study 
should be performed using grade dependent course examinations. 

In suimnary, the present study describes a potentially valuable disguised 
measure of over- confidence on objective examinations. This measure, which lndlcat( 
the degree of confidence a subject possesses over and, above that which is due to 
subject matter knowledge (vocabulary knowledge), was significantly related to 
probabilistically derived test scores and less highly related to number right 
conventional test scores. It would appear, therefore, that confidence responding 
methods produce variability in scores that cannot be attributed to knowledge of 
subject matter (in this study, wcabulary). If these findings could be generalized 
to all types of objective itests administered under confidence response, directions,. 

one could not recliommend such response methods as reasonable alternatives to the 

. . ft • 

conventional rights-only procedure. 

In" addition, the present study indicates that over-c;pnfidence in one's 
responses to vocabulary test items is not identical to one's propensity to take 
risks on such test items. The measure of over-confidence described in this study 
was only moder,ately related to a measure of risk-taking propensity, and . this 
relationship may have been of a spurious nature. 

Further Research is necessary to investigate possible relationships between 
the disguised confidence measure described here and various personality traits 
of examinee/s. ' . t 
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Table 1 



MEANS, STANDARD 


DEVIATIONS, 
N 


AND RELIABILITIES 
- 208 


OF ALL SCORES 


SCORE 


MEAN 


STANDARD 
DEVIATION 


reliability! 




0 70 


0 75* 


0.57 1 

/ 








0.80 






' 2.97\ ^ 

\ 


0.74 


Spnerlcai K^^^^ 




^» / 0 \ 




No. Right (L) 


14.00 


6.18 


0.85 


Corrected (G) 


10.65 


6.95 


0.82 


Risk (R) 


0.39 


0.36 


0.87 


Residuals for S^(cp 


0.00 


0.16 


0.82* 


Residuals for 82(02) 


0.00 


0.17 


0.86* 


Residuals for S^CC^) 


' 0.00 


0.14 


0.74* 



*rellabllltles of linear combinations: • 



Table 2 



CORRELATIONS AMONG VARIOUS SCORES 
N - 208" 



SCORES C, 


^1 % S3 


L G 


R 


^1 S 


Confidence (C^) — * 


(-.59)* (-.69) (-.45) 


(-.22) (-.31) 


(.46) 


(.66) (.54) (.77) 


Quadratic (S^) -.40 
Logarithmic (S2) -.45 
Spherical (S3) -.31 


.97 

.98 .94 








No. Right (L) -.15 
Corrected (G) • -.21 


.79 .71 .83 
.88 .80 .91 


.96 ; 






Risk (R) .32 


-.32 -.32 -.32 


.06 -.17 






Residuals for Sj^(Cj^) .45 
Residuals for S2(C2) •38 
Residuals for S3(C3) .50 




.36 .31 
.48 .44 
.24 .17 


i07 
.02 
.12 





.*Values in parentheses were determined by using the "correction for attenuation," 
since C. scores had lower reliability than the other scores. 



